You are here:

Machine Learning Functions

The following is a brief guide to the various ML functions included in the Data Flow section of Model

Clustering

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)

DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm. It groups together points by distance
Used when cluster number is unknown, it is considerably accurate. It can be used in indoor location for understanding number of rooms, common location, etc.
Requires minimal number of neighbors and maximum distance for a neighbor.
Analyze by scatter / bubble graph color by cluster number, may indicate a similar group that later can be filter and further analyze. The number of groups by themselves can suggest a different approach to analyze the data. Also data points that have no cluster can be referred as outlier

EMMD

Expectation Maximization Mixed Data (EMMD) is a clustering method based on probabilities. It support numerical and categorical data
Use in mixed data (numeric and categorical) and unknown number of cluster. For example grouping product color together with sales, state and expanses may show that red is similar to yellow at 3 different states where sales and expenses have a high value
Requires upper limit for cluster number
The analysis of this kind clustering involves the usage of several graphs and several slices. Another usage would be finding outliers in the depended mixed data space.

Hierarchical clustering

Hierarchical clustering group data over a variety of scales by creating a cluster tree. Clusters at one level of the tree are joined as clusters at the next level. This allows you to decide the level or scale of clustering that is most appropriate for the application.
This algorithm also support both numeric and categorical data.
Use in mixed data and small datasets an example would be similar to the EMMD but can only work with small dataset and will output with more accurate result.
Requires number of clusters
The analysis of this kind clustering involves the usage of several graphs and several slices.

K-means

k-means clustering aims to partition numeric observations into k clusters (chosen or estimated) in which each observation belongs to the cluster with the nearest mean.
Use for numerical data. A usage example for k-means can be geo-clustering in order to find someone's home and work by only latitude and longitude.
Requires (optional) number of clusters. If not specified the number of cluster is determined by the elbow method
Analyze by scatter / bubble / maps graph, color by cluster number, may indicate a similar group that later can be filter and further analyze.

PAM

Partitioning (clustering) of the data into k clusters "around medoids", a more robust version of K-means
Can be used in noisy dataset (large number of outlier etc.) the advantage over the k-means is that the cluster location is less effected by outlier.
Requires number of cluster
Analyze by scatter / bubble / maps graph, color by cluster number, may indicate a similar group that later can be filter and further analyze.

Canopy

Canopy clustering is the fastest numeric clustering algorithm. It is used as a preprocessing step for other clustering algorithms or for speeding up the clustering operations in large datasets
Use when the cluster number is unknown and the dataset is large, such as grouping different activities based on accelerometer data
Analyze by scatter / bubble graph color by cluster number, may indicate a similar group that later can be filtered and further analyzed. The number of groups by themselves can suggest a different approach to analyzing the data. Data points that have no cluster can be referred as outliers.

Classifiers - Prediction

Classification is the problem of identifying to which of a set of categories (features) a new observation belongs (label), on the basis of a training set of data containing observations (or instances) whose category membership is known

KNN

Is a classifier for categorical values, based on numeric distance
Uses numeric data as feature vector and categorical data as labels. Can be used for prediction if someone will buy a citrine item based on the number of times he/she saw a commercial, yearly income and number of previous purchases in an online store.
It requires a minimum number of neighbors (K)
Predicting a new sample data in order to estimate if potential buyer will buy

Naïve Bayes

Naïve Bayes is a multiclass classification algorithm with the assumption of independence between every pair of features.
Uses categorical data as feature vector and categorical data as labels. For example, it can be used for predicting if someone will play tennis based on Outlook (sunny/overcast/raining) Temperature(hot/mild/cold) Humidity(high/normal) Windy(true/false)
Requires λ (smoothing parameter) for handling sparse data or unknown word
Predict if your father will go play tennis or stay home.

Click here to learn more about Naïve Bayes.

Decision tree

Decision tree (as a predictive model) each column represent a branch, follow the tree for each row (from the root bottom) in order to predict. Slow but accurate
Uses mixed data for branches (feature vector) and categorical for prediction (labels). Can be used for prediction of purchase not purchase based on mixed data (age, education, height, home owner) with small data set
Predicting a new sample data in order to estimate if potential buyer will buy.

Click here to learn more about Decision Trees.

Random forest

Random forests are ensembles of decision trees. They combine many small, randomized sampled decision trees in order to reduce the risk of over fitting.
Uses mixed data for feature vector and categorical data as labels. Random forest could be used to predict whether someone will have the flu based on (smoker, diabetes, alcohol, vegetarian, live in the city etc.)
Predict how should get vaccine.

Shallow Neural Net

Artificial Neural nets are computing systems inspired by the biological neural networks that constitute animal brains. The shallow neural net is consist of one hidden layer
Uses mixed data for feature vector and categorical data as labels. Can be used for image recognition for example if a user profile picture is a male or a female
Add gender tag for all users and save a question in the join-in form.

Support Vector Machine (SVM)

Support Vector Machines is a classification algorithm that maps the data to points in space such that the gap between points of different categories is as wide as possible
Uses numeric data for the feature vector and categorical data as labels. Can be used for prediction if someone will buy a certain item based on the number of times he/she saw a commercial, yearly income and number of previous purchases in an online store.

Regression

Regression tree

Is an estimation of numeric value based on tree
Uses mixed data for feature vector and numeric data as target. Can be used for either clean data or estimation of numeric values
It requires the depth and wide of the tree
Estimate sales based on new data

Click here to learn more about Regression Trees.

Interpolation

Linear interpolation

Is a mathematical method of modeling the data by a set of linear lines, with each line connecting two sample data points
Can be used for smoothing inaccurate data that inherently contains noise
Preferable way to interpolate when the data behaves similarly to a set of linear lines
Get smoothed values of monthly expenses data

Polynomial interpolation

Is a mathematical method of modeling the data by a polynomial connecting sample data points
Can be used for smoothing inaccurate data that inherently contains noise
Preferable when the data behaves similarly to a set of polynomial
Get smoothed values of monthly expenses data

Splines

Is a mathematical method of modeling the data by a set of polynomials (splines), each polynomial connecting two data points
Can be used for smoothing inaccurate data that inherently contains noise
Get smoothed values of monthly expenses data

NEXT: Learn about the Machine Learning Nodes.

Feedback

Couldn't find what I was looking for

Help was confusing, unclear or incomplete

Instructions didn't work